Skip to content

Optimize CmSketch AVX block access and in-register updates for frequency estimate and increment#783

Draft
Copilot wants to merge 8 commits intomainfrom
copilot/optimize-cmsketch-using-intrinsics
Draft

Optimize CmSketch AVX block access and in-register updates for frequency estimate and increment#783
Copilot wants to merge 8 commits intomainfrom
copilot/optimize-cmsketch-using-intrinsics

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 19, 2026

CmSketchCore's AVX2 paths were still doing per-counter gathers and scalar table updates. This change restructures both hot paths to operate on the 64-byte sketch block as two contiguous Vector256<long> loads, reducing memory traffic and keeping counter selection/update local to the loaded vectors.

  • What changed

    • EstimateFrequencyAvx now reads the target block once as two consecutive Vector256<long> values and extracts the four candidate counters from those vectors instead of gathering four independent longs.
    • IncrementAvx now follows the same block-oriented pattern: load both halves of the block, compute the selected counter updates in-register, then write the updated block back with two contiguous vector stores.
    • On runtimes where the sketch table pointer is pre-aligned, the AVX path now uses aligned loads and stores.
    • The IncrementAvx update path no longer drops back to scalar lane extraction/reconstruction: lane selection, saturation checks, and increments are now performed with vector masks and variable shifts.
  • Hot-path shape

    • Replaced gather-based counter access with direct block reads from block and block + 4.
    • Preserved the existing hash/indexing layout and saturating 4-bit counter semantics.
    • Kept the update logic explicit per lane so the AVX path still maps exactly to the sketch’s two-slot-per-segment layout.
    • Corrected the pre-aligned table pointer math so aligned AVX accesses are safe.
  • Resulting AVX flow

    Vector256<long> lower = Avx.LoadAlignedVector256(tablePtr + block);
    Vector256<long> upper = Avx.LoadAlignedVector256(tablePtr + block + 4);
    
    // select the target counters from the loaded block halves,
    // compute saturation masks and increments in vector registers,
    // then store the updated halves back contiguously
    Avx.StoreAligned(tablePtr + block, lower);
    Avx.StoreAligned(tablePtr + block + 4, upper);

Copilot AI and others added 2 commits April 19, 2026 07:00
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/7289ae1d-0fa0-43fd-bbe1-2511d41eb3d9

Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/7289ae1d-0fa0-43fd-bbe1-2511d41eb3d9

Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 19, 2026

Coverage Status

coverage: 99.151% (+0.01%) from 99.14% — copilot/optimize-cmsketch-using-intrinsics into main

#endif
{
Vector128<ushort> count = Avx2.PermuteVar8x32(Avx2.And(Avx2.ShiftRightLogicalVariable(Avx2.GatherVector256(tablePtr, blockOffset, 8), indexLong), Vector256.Create(0xfL)).AsInt32(), Vector256.Create(0, 2, 4, 6, 1, 3, 5, 7))
Vector256<long> lower = Avx.LoadVector256(tablePtr + block);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can this use the load aligned vector method instead to make it faster?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in cfbd9ac to use aligned AVX loads/stores on the pre-aligned table pointer, and corrected the pointer alignment math so those aligned accesses are safe.

Copilot AI requested a review from bitfaster April 20, 2026 01:58
Comment thread BitFaster.Caching/Lfu/CmSketchCore.cs Outdated
Vector256<long> upper = Avx.LoadVector256(tablePtr + block + 4);
#endif

long lower0 = lower.GetElement(0);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can these operations be vectorized so that all computation is done in vector registers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — updated in 07f4d28 so the IncrementAvx lane selection, saturation check, and counter increment stay in vector registers, with only the final block stores leaving the registers.

Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/c76dc857-c67e-4ce6-a05f-2ebb894fcc0b

Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
Copilot AI changed the title Optimize CmSketch AVX block access for frequency estimate and increment Optimize CmSketch AVX block access and in-register updates for frequency estimate and increment Apr 20, 2026
Copilot AI requested a review from bitfaster April 20, 2026 02:06
@bitfaster
Copy link
Copy Markdown
Owner

From iteration 1

BenchmarkDotNet v0.15.8, Windows 11 (10.0.26200.8246/25H2/2025Update/HudsonValley2)
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
  [Host]   : .NET Framework 4.8.1 (4.8.9325.0), X64 RyuJIT VectorSize=256
  .NET 6.0 : .NET 9.0.15 (9.0.15, 9.0.1526.17522), X64 RyuJIT x86-64-v4
  .NET 8.0 : .NET 9.0.15 (9.0.15, 9.0.1526.17522), X64 RyuJIT x86-64-v4
  .NET 9.0 : .NET 9.0.15 (9.0.15, 9.0.1526.17522), X64 RyuJIT x86-64-v4
Method Size Mean Error StdDev Ratio
FrequencyFlat 32768 16.45 ns 0.165 ns 0.146 ns 1.00
FrequencyFlatAvx 32768 30.63 ns 0.166 ns 0.155 ns 1.86
FrequencyBlock 32768 20.73 ns 0.043 ns 0.036 ns 1.26
FrequencyBlockUnroll 32768 18.37 ns 0.061 ns 0.054 ns 1.12
FrequencyBlockAvxNotPinned 32768 20.43 ns 0.095 ns 0.084 ns 1.24
FrequencyBlockAvxPinned 32768 17.26 ns 0.069 ns 0.062 ns 1.05
FrequencyFlat 32768 16.27 ns 0.056 ns 0.044 ns 1.00
FrequencyFlatAvx 32768 30.55 ns 0.117 ns 0.098 ns 1.88
FrequencyBlock 32768 20.78 ns 0.132 ns 0.117 ns 1.28
FrequencyBlockUnroll 32768 18.37 ns 0.049 ns 0.041 ns 1.13
FrequencyBlockAvxNotPinned 32768 20.41 ns 0.038 ns 0.032 ns 1.25
FrequencyBlockAvxPinned 32768 19.36 ns 0.068 ns 0.060 ns 1.19
FrequencyFlat 32768 16.70 ns 0.318 ns 0.312 ns 1.00
FrequencyFlatAvx 32768 30.74 ns 0.210 ns 0.196 ns 1.84
FrequencyBlock 32768 21.27 ns 0.179 ns 0.140 ns 1.27
FrequencyBlockUnroll 32768 18.12 ns 0.084 ns 0.078 ns 1.09
FrequencyBlockAvxNotPinned 32768 20.68 ns 0.206 ns 0.193 ns 1.24
FrequencyBlockAvxPinned 32768 18.02 ns 0.360 ns 0.611 ns 1.08
FrequencyFlat 524288 38.05 ns 1.074 ns 2.957 ns 1.01
FrequencyFlatAvx 524288 42.55 ns 0.811 ns 1.601 ns 1.12
FrequencyBlock 524288 34.24 ns 0.679 ns 1.715 ns 0.91
FrequencyBlockUnroll 524288 28.67 ns 0.453 ns 0.424 ns 0.76
FrequencyBlockAvxNotPinned 524288 52.86 ns 0.926 ns 0.821 ns 1.40
FrequencyBlockAvxPinned 524288 30.97 ns 1.288 ns 3.715 ns 0.82
FrequencyFlat 524288 35.08 ns 0.689 ns 1.053 ns 1.00
FrequencyFlatAvx 524288 43.93 ns 0.986 ns 2.782 ns 1.25
FrequencyBlock 524288 35.57 ns 1.046 ns 2.983 ns 1.01
FrequencyBlockUnroll 524288 28.68 ns 0.482 ns 0.428 ns 0.82
FrequencyBlockAvxNotPinned 524288 53.60 ns 1.062 ns 1.803 ns 1.53
FrequencyBlockAvxPinned 524288 26.89 ns 0.371 ns 0.310 ns 0.77
FrequencyFlat 524288 35.61 ns 0.703 ns 1.468 ns 1.00
FrequencyFlatAvx 524288 44.34 ns 1.399 ns 3.967 ns 1.25
FrequencyBlock 524288 36.07 ns 0.881 ns 2.498 ns 1.01
FrequencyBlockUnroll 524288 31.72 ns 1.006 ns 2.821 ns 0.89
FrequencyBlockAvxNotPinned 524288 48.90 ns 0.952 ns 0.977 ns 1.38
FrequencyBlockAvxPinned 524288 25.99 ns 0.511 ns 0.796 ns 0.73
FrequencyFlat 8388608 120.57 ns 1.145 ns 1.071 ns 1.00
FrequencyFlatAvx 8388608 153.23 ns 1.518 ns 1.185 ns 1.27
FrequencyBlock 8388608 107.97 ns 0.559 ns 0.495 ns 0.90
FrequencyBlockUnroll 8388608 105.11 ns 0.513 ns 0.428 ns 0.87
FrequencyBlockAvxNotPinned 8388608 93.45 ns 0.658 ns 0.616 ns 0.78
FrequencyBlockAvxPinned 8388608 72.32 ns 0.360 ns 0.319 ns 0.60
FrequencyFlat 8388608 119.37 ns 0.748 ns 0.700 ns 1.00
FrequencyFlatAvx 8388608 155.80 ns 2.917 ns 2.865 ns 1.31
FrequencyBlock 8388608 108.23 ns 0.864 ns 0.808 ns 0.91
FrequencyBlockUnroll 8388608 105.65 ns 1.110 ns 1.038 ns 0.89
FrequencyBlockAvxNotPinned 8388608 92.09 ns 0.304 ns 0.269 ns 0.77
FrequencyBlockAvxPinned 8388608 71.91 ns 0.371 ns 0.329 ns 0.60
FrequencyFlat 8388608 119.79 ns 1.043 ns 0.976 ns 1.00
FrequencyFlatAvx 8388608 153.82 ns 1.917 ns 1.793 ns 1.28
FrequencyBlock 8388608 108.09 ns 0.729 ns 0.646 ns 0.90
FrequencyBlockUnroll 8388608 105.40 ns 1.012 ns 0.897 ns 0.88
FrequencyBlockAvxNotPinned 8388608 102.15 ns 1.696 ns 1.503 ns 0.85
FrequencyBlockAvxPinned 8388608 71.71 ns 0.284 ns 0.222 ns 0.60
FrequencyFlat 134217728 151.15 ns 2.331 ns 2.180 ns 1.00
FrequencyFlatAvx 134217728 192.26 ns 2.352 ns 2.085 ns 1.27
FrequencyBlock 134217728 139.12 ns 2.636 ns 2.201 ns 0.92
FrequencyBlockUnroll 134217728 135.43 ns 2.656 ns 2.354 ns 0.90
FrequencyBlockAvxNotPinned 134217728 126.05 ns 2.497 ns 3.418 ns 0.83
FrequencyBlockAvxPinned 134217728 87.72 ns 1.664 ns 2.825 ns 0.58
FrequencyFlat 134217728 152.17 ns 2.978 ns 5.738 ns 1.00
FrequencyFlatAvx 134217728 191.58 ns 1.153 ns 0.900 ns 1.26
FrequencyBlock 134217728 138.36 ns 2.228 ns 1.975 ns 0.91
FrequencyBlockUnroll 134217728 138.88 ns 2.746 ns 2.569 ns 0.91
FrequencyBlockAvxNotPinned 134217728 128.71 ns 1.928 ns 1.709 ns 0.85
FrequencyBlockAvxPinned 134217728 89.27 ns 1.756 ns 3.384 ns 0.59
FrequencyFlat 134217728 149.08 ns 2.330 ns 1.946 ns 1.00
FrequencyFlatAvx 134217728 192.35 ns 1.971 ns 1.646 ns 1.29
FrequencyBlock 134217728 139.65 ns 2.613 ns 3.664 ns 0.94
FrequencyBlockUnroll 134217728 134.90 ns 2.685 ns 3.935 ns 0.91
FrequencyBlockAvxNotPinned 134217728 125.91 ns 2.248 ns 4.792 ns 0.84
FrequencyBlockAvxPinned 134217728 87.25 ns 1.586 ns 1.325 ns 0.59

@bitfaster
Copy link
Copy Markdown
Owner

From commit 6

Method Size Mean Error StdDev Ratio
FrequencyFlat 32768 16.19 ns 0.111 ns 0.099 ns 1.00
FrequencyFlatAvx 32768 30.23 ns 0.110 ns 0.092 ns 1.87
FrequencyBlock 32768 20.54 ns 0.063 ns 0.059 ns 1.27
FrequencyBlockUnroll 32768 18.19 ns 0.068 ns 0.057 ns 1.12
FrequencyBlockAvxNotPinned 32768 20.18 ns 0.028 ns 0.024 ns 1.25
FrequencyBlockAvxPinned 32768 17.22 ns 0.210 ns 0.175 ns 1.06
FrequencyFlat 32768 17.76 ns 0.305 ns 0.456 ns 1.00
FrequencyFlatAvx 32768 30.88 ns 0.325 ns 0.304 ns 1.74
FrequencyBlock 32768 24.59 ns 1.473 ns 4.155 ns 1.39
FrequencyBlockUnroll 32768 18.65 ns 0.373 ns 0.643 ns 1.05
FrequencyBlockAvxNotPinned 32768 20.60 ns 0.335 ns 0.470 ns 1.16
FrequencyBlockAvxPinned 32768 17.05 ns 0.117 ns 0.098 ns 0.96
FrequencyFlat 32768 17.08 ns 0.060 ns 0.053 ns 1.00
FrequencyFlatAvx 32768 30.29 ns 0.186 ns 0.156 ns 1.77
FrequencyBlock 32768 20.46 ns 0.069 ns 0.065 ns 1.20
FrequencyBlockUnroll 32768 18.06 ns 0.019 ns 0.016 ns 1.06
FrequencyBlockAvxNotPinned 32768 20.13 ns 0.058 ns 0.049 ns 1.18
FrequencyBlockAvxPinned 32768 17.28 ns 0.034 ns 0.028 ns 1.01
FrequencyFlat 524288 33.33 ns 0.694 ns 1.934 ns 1.00
FrequencyFlatAvx 524288 40.40 ns 0.795 ns 0.664 ns 1.22
FrequencyBlock 524288 32.14 ns 0.628 ns 1.697 ns 0.97
FrequencyBlockUnroll 524288 27.30 ns 0.534 ns 0.593 ns 0.82
FrequencyBlockAvxNotPinned 524288 46.27 ns 0.534 ns 0.417 ns 1.39
FrequencyBlockAvxPinned 524288 24.38 ns 0.455 ns 0.541 ns 0.73
FrequencyFlat 524288 34.50 ns 0.554 ns 0.519 ns 1.00
FrequencyFlatAvx 524288 38.88 ns 0.573 ns 0.447 ns 1.13
FrequencyBlock 524288 30.78 ns 0.377 ns 0.294 ns 0.89
FrequencyBlockUnroll 524288 26.91 ns 0.496 ns 0.440 ns 0.78
FrequencyBlockAvxNotPinned 524288 42.96 ns 0.850 ns 0.710 ns 1.25
FrequencyBlockAvxPinned 524288 24.28 ns 0.281 ns 0.375 ns 0.70
FrequencyFlat 524288 32.01 ns 0.629 ns 0.673 ns 1.00
FrequencyFlatAvx 524288 68.23 ns 1.271 ns 2.053 ns 2.13
FrequencyBlock 524288 30.67 ns 0.378 ns 0.335 ns 0.96
FrequencyBlockUnroll 524288 26.72 ns 0.365 ns 0.324 ns 0.84
FrequencyBlockAvxNotPinned 524288 43.60 ns 0.852 ns 0.982 ns 1.36
FrequencyBlockAvxPinned 524288 28.28 ns 1.304 ns 3.784 ns 0.88
FrequencyFlat 8388608 120.52 ns 1.784 ns 1.669 ns 1.00
FrequencyFlatAvx 8388608 132.75 ns 0.927 ns 0.867 ns 1.10
FrequencyBlock 8388608 110.80 ns 2.204 ns 2.263 ns 0.92
FrequencyBlockUnroll 8388608 106.11 ns 1.131 ns 1.002 ns 0.88
FrequencyBlockAvxNotPinned 8388608 94.72 ns 1.855 ns 2.888 ns 0.79
FrequencyBlockAvxPinned 8388608 65.02 ns 1.147 ns 0.958 ns 0.54
FrequencyFlat 8388608 120.59 ns 1.143 ns 0.893 ns 1.00
FrequencyFlatAvx 8388608 132.03 ns 0.819 ns 0.684 ns 1.09
FrequencyBlock 8388608 107.73 ns 0.832 ns 0.778 ns 0.89
FrequencyBlockUnroll 8388608 105.49 ns 1.243 ns 1.163 ns 0.87
FrequencyBlockAvxNotPinned 8388608 87.32 ns 1.045 ns 0.978 ns 0.72
FrequencyBlockAvxPinned 8388608 64.13 ns 0.713 ns 0.667 ns 0.53
FrequencyFlat 8388608 117.98 ns 0.416 ns 0.325 ns 1.00
FrequencyFlatAvx 8388608 131.67 ns 0.573 ns 0.508 ns 1.12
FrequencyBlock 8388608 109.12 ns 2.114 ns 1.874 ns 0.92
FrequencyBlockUnroll 8388608 111.62 ns 1.050 ns 0.982 ns 0.95
FrequencyBlockAvxNotPinned 8388608 86.60 ns 0.570 ns 0.476 ns 0.73
FrequencyBlockAvxPinned 8388608 63.74 ns 0.339 ns 0.300 ns 0.54
FrequencyFlat 134217728 149.48 ns 2.962 ns 3.746 ns 1.00
FrequencyFlatAvx 134217728 170.21 ns 0.996 ns 0.832 ns 1.14
FrequencyBlock 134217728 136.37 ns 2.235 ns 1.867 ns 0.91
FrequencyBlockUnroll 134217728 150.60 ns 4.387 ns 12.865 ns 1.01
FrequencyBlockAvxNotPinned 134217728 118.13 ns 2.209 ns 4.610 ns 0.79
FrequencyBlockAvxPinned 134217728 83.86 ns 2.241 ns 6.538 ns 0.56
FrequencyFlat 134217728 149.18 ns 2.828 ns 3.026 ns 1.00
FrequencyFlatAvx 134217728 173.91 ns 3.301 ns 3.669 ns 1.17
FrequencyBlock 134217728 138.82 ns 2.418 ns 3.058 ns 0.93
FrequencyBlockUnroll 134217728 132.66 ns 2.228 ns 1.975 ns 0.89
FrequencyBlockAvxNotPinned 134217728 114.74 ns 2.092 ns 1.747 ns 0.77
FrequencyBlockAvxPinned 134217728 92.21 ns 5.371 ns 15.668 ns 0.62
FrequencyFlat 134217728 150.32 ns 2.597 ns 3.284 ns 1.00
FrequencyFlatAvx 134217728 172.57 ns 2.700 ns 2.393 ns 1.15
FrequencyBlock 134217728 136.90 ns 2.691 ns 4.346 ns 0.91
FrequencyBlockUnroll 134217728 131.44 ns 2.117 ns 1.768 ns 0.87
FrequencyBlockAvxNotPinned 134217728 124.26 ns 3.825 ns 11.159 ns 0.83
FrequencyBlockAvxPinned 134217728 80.61 ns 1.504 ns 3.073 ns 0.54

Comment thread BitFaster.Caching/Lfu/CmSketchCore.cs Outdated
int lane2 = h2 & 1;
int lane3 = (h3 & 1) + 2;

Vector256<ulong> lowerIndex = Vector256.Create(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can this be written so that the values are loaded into vector registers without any branches, and there is a mask for the lanes that sets values to zero after they are loaded. That way all of this can be expressed mainly with vector assign and masking without the branches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in aabff69 to remove the branchy lane setup: the block halves are still loaded directly into vector registers, and the active lanes are now selected with vector compare masks so inactive lanes are zeroed by masking in-register.

Copilot AI requested a review from bitfaster April 21, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants